Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full parser for license expressions (version II) #29

Closed
wants to merge 3 commits into from

Conversation

bossmc
Copy link

@bossmc bossmc commented May 28, 2019

I saw #5, which adds a full blown parser to this crate in order to resolve #3 and to add support for DocumentRef-<blah>:LicenseRef-<blah>. It seems that PR has been somewhat abandoned (last touched in 2016), and doesn't actually support returning the parsed license expression.

This PR is similar in structure to that one, adding a lalrpop parser to handle the expression format, but has a few differences:

  • Uses a custom lexer to simplify the structure of the parser (avoids the ambiguities in SPDX license expression parser using lalrpop library. #5 between ExceptionIDs and LicenseIDs)
  • Parses the expression into an actual object for later analysis (using borrows to prevent unnecessary string allocations)
  • Handles nested OR and AND statements as per the specification (with AND at a higher precedence than OR)

This does introduce a few breaking changes:

  • Errors are presented as failure::Error (subsuming Use failure for error handling #20 as well)
  • Some license expressions that used to (spuriously) parse as valid are now rejected e.g. MIT WITH or MIT AND - I assume this is fine, as the old behaviour was simply incorrect

And finally it does introduce one technical bug, the SPDX specification states:

There MUST NOT be white space between a license-id and any following "+". This supports easy parsing and backwards compatibility

Ironically (?) this is harder to implement in most parser toolchains, since most lexers ignore whitespace between tokens (including lalrpop's default one, and my custom one). Currently this means that the code in this PR will accept MIT + as meaning the same as MIT+. I think this is fixable by making the lexer more complicated, but I've not tried and I worry that it'll make the code less maintainable. Thoughts?

Finally, I've not done anything around Cargo's slightly non-compliant use of the specification (NOASSERTION and NONE), since those can be handled out of band if Cargo wishes (since there's no need for a parser to determine if the expression is one of those special cases). Again I'm interested in your thoughts on this - while I'm doing this work in the context of some non-OSS code, I can tell you that Cargo is no longer the only user of this crate 😀.

* Use a custom Lexer to handle License vs Exception vs Reference
* Don't copy strings, slice them from the input
* Handle whole ABNF from https://spdx.org/spdx-specification-21-web-version#h.jxpfx0ykyb60
@bossmc
Copy link
Author

bossmc commented Sep 12, 2019

@withoutboats - You seem to be the main developer on this codebase historically, could you have a look at this PR please?

@withoutboats
Copy link
Collaborator

I no longer maintain this code, but I'm pretty dubious of adding lalrpop (which is a heavy dependency that takes quite a bit of time to compile) to parse license expressions, which have a very simple grammar. This will seriously degrade the compile time of any users who aren't already depending on lalrpop.

Jake-Shadle added a commit to EmbarkStudios/spdx that referenced this pull request Sep 17, 2019
This is basically the same lexer introduced
in ehuss/license-exprs#29
@carols10cents
Copy link
Contributor

Doing some cleanup; it looks like this change is stale and not what's ultimately desired anyway, so I'm closing this PR. Thank you!

@bossmc
Copy link
Author

bossmc commented Sep 27, 2019

@carols10cents - Who should I be talking to about the correct fix here? I read @withoutboats's suggestion but, as he's no longer a maintainer, I was hoping to get an opinion from one of the current maintainers before starting to re-work the change. I understand the worry about the dependency explosion, but I think most of these dependencies are already present in Cargo (and I don't believe that writing a parser by hand is a good idea, it's likely to be buggy and the lalrpop generator "just works":tm:)

As a heads-up, there seems to be an alternative crate for handling SPDX being created (including some of this very change) at https://github.com/EmbarkStudios/spdx, they've hand-rolled a Shunting-Yard-based parser which I'm sure they'd be happy to share if that's a preferred solution?

Jake-Shadle added a commit to EmbarkStudios/spdx that referenced this pull request Oct 3, 2019
* Use a better lexer, taken from ehuss/license-exprs#29
* Add a simple parser/evaluator
* Add an update exe for pulling new SDPX information, also taken from https://github.com/rust-lang-nursery/license-exprs
* Add support for some of the metadata available from the SPDX format, namely "IsDeprecated", "IsFSFLibre", "IsOSIApproved"

Resolves: #1
@Jake-Shadle
Copy link

We'd be happy to share it or add more maintainers if people are interested, as @withoutboats mentioned, the expressions are rather simple and hand parsing wasn't difficult, though if you find any test cases that aren't correct I would love to incorporate them!

@withoutboats
Copy link
Collaborator

@Jake-Shadle this crate doesn't handle license expressions with parens in them (as in (MIT OR Apache-2.0) AND GPL-3.0+. But matching parens nesting should be a feature that can be added without depending on a parser generator that will increase the build time of the library by orders of magnitude.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parens for setting precedence
4 participants